Prosper Loan Exploratory Data Analysis by Alex Kettell


Load the Prosper loan data and check the dimensions.

## [1] 113937     81

Prosper loan data set from 2014 intro:

I have chosen to explore the Prosper loan data that contains 113,937 records - 1 for each unique loan - across 81 distinct variables. Prosper is a peer-to-peer loan platform that allows people with excess savings to invest notional amounts in consumer grade loans with the idea that investing in a broad cross section of loans limits the overall risk for the investor. It is assumed that this data is composed of a random sample of anonymized loans across the platform in 2014.

Univariate Plots Section

Run summaries on and plot the distributions of some of the key variables from the dataset.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.00   36.00   36.00   40.83   36.00   60.00

It looks like the vast majority of loans in this data have a defined term of 36 months, with a smaller number (~ 25K) having a defined term of 60 months. It may be interesting further along in this analysis to group the loans by term and run separate analysis of loans of each term length. A very comparatively small number of loans have a term length of 12 months and I will seek to eliminate these loans from any analysis so as not to skew the data. Term almost becomes a categorical variable in this sense, since it is not of a continuous nature.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0   660.0   680.0   685.6   720.0   880.0     591

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    19.0   679.0   699.0   704.6   739.0   899.0     591

Looking at the distribution of credit scores for borrowers, we can see that the ‘RangeLower’ and ‘RangeUpper’ scores have different distributions. the unimodal variance in the data occur at different points along the x axis with the upper range showing a consistent plateau between 650 and 750 other than an abrupt ‘peak’ around 750. the lower range has a similar plateau but the peak occurs at the left most side at 650. both score ranges have 591 NA values which have been omitted from the plots, as have the scores below 400. I will choose to work with the lower range scores because this is generally what banks do when considering loans to borrowers.

Also, when applying for a loan on Prosper, many of the criteria are self-reported and/or unverified, while others are likely not even mandatory to include. I would imagine that scores reported by consumer credit reporing bureaus are not in any of these categories and are mandatory as part of the application process. in that sense, it is one of the most reliable indicators - for better or worse - of an applicant’s creditworthiness. I plan to use this variable in further multivariate analysis.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   26.00   67.00   96.07  137.00  755.00    7625

The data for employment duration in months is clearly right skewed in the first plot. at first this suprised me, but then it occurred to me that it may be difficult to maintain continuous employment for periods in excess of 10 years. whether this means at the same job is also another question worth asking: does that explain the higher counts of loans to individuals with durations of employment between 0 - 5 years. this may also speak to the fact that borrowers will less employment stability may apply for these types of consumer loans at a higher clip than other segments of the population. this variable may be interesting for further bivariate analysis.

In the second plot I log transformed the long-tailed employment duration data to better view the borrower distribution. it is now more normally distributed with a peak near 100 months.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    6500    8337   12000   35000

The original loan amount varies between a min of $1,000, a max of $35,000 and a median of $6,500. most loans are for $10,000 or less. the data are right skewed because of the fewer number of loans for much higher relative amounts.

we see peaks of loans for rounded dollar amounts ie. $10,000; $15,000; $20,000 and $25,000. this is likely particular to the Prosper loan platform as they would encourage borrowers to apply for loans at a rounded amount. that way, many investors can automate lending of smaller amounts across many loans and each loan can be split up without any ‘leftover’ amount. my experience has shown that investors typically fund loans in $25, $50 or $100 increments.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    2.00   44.00   80.48  115.00 1189.00

I wanted to check the distribution of several categorical variables before moving on to bivariate analysis. most borrowers are employed in the categories of ‘Employed’ or ‘Full Time’. homeownership is pretty evenly split between True and False, which is interesting. and the broad cross-section of loans shows the different state that the various loans are in at the time the data was pulled. we see that many loans are in a ‘Current’ or ‘Completed’ state and that there are various stages of ‘Past Due’ before the loans are in default and ultimately charged off. a large portion of the loans have not had their borrowers rated according to the Prosper rating system with the stipulation that this has been applied to loans that originated after July 2009.

I may chose to use some of these categorical variables in my bivariate analysis of borrowers and their associated loans.

Univariate Analysis

What is the structure of your dataset?

The dataset consists of almost 114K unique loan profiles across 82 variables.

What is/are the main feature(s) of interest in your dataset?

Many of the variables across the dataset profile the borrowers who apply for and have their loans invested in across the Prosper lending platform. various factors end up contributing to the interest rate & annual percentage rate that the borrower is required to pay at regular monthly intervals over the term of the loan.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

Credit score, grade, debt to income ratio and other categorical variables I believe contribute to Prosper’s own scoring of the borrower, which in turn is reflected in the borrower’s APR /interest rate. I believe the external credit scoring contributes the most to the interest rate on the loan, but will seek to verify this with my analysis.

Did you create any new variables from existing variables in the dataset?

I have not up to this point.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

I have log-transformed the right-skewed data for employment duration in months and the number of investors per loan. the transformed employment duration data revealed a center mass around 100 months of continuous employment. whether this just means at the same exact job or over successive jobs is unknown.

the transformed data for number of investors per loan revealed that there is an extraordinarily high number of loans (~ 30K) funded by a single investor for each loan. otherwise, the data are nearly normally distributed with a center mass close to 100 investors per loan. I will seek to visualize the distribution of loans by number of investors in a more robust way.

Bivariate Plots Section

In order to mix a single continuous variable with a single categorical variable the above frequency polygons display the relative distribution of loans by borrower interest rate split by employment status and homeownership. nothing surprising here with most loans going to employed borrowers and those borrowers that went into more detail to indicate that they are employed full-time (possibly in non-exempt positions entitled to overtime pay).

Homeowners look to receive a greater percentage of loans with lower interest rates than non-homeowners, especially when the rate is below 18% where there is a crossover point. above the 18% threshold we see that loans to non-homeowners outnumber loans to homeowners. this is interesting.

I would like to look some more at borrower interest rate and some other variables that may be strongly correlated.

It makes sense for the borrower interest rate to be near perfectly linear to the lender yield, since interest rate = yield less servicing fees. there is slight shadowing below the primary line due to the loan servicing fees.

the second plot is a bit more difficult to interpret since the estimated effective yield accounts for expected losses from charge-offs. here we see several linear bands in the data where the borrower rate is positive and the estimated yield is actually negative! so when including expected losses and servicing fees the investors in these loans either already have or can expect to lose money. the bands tell me that the Prosper rating system likely groups these loans into risk categories and assigns expected yield based on risk.

## 
##  Pearson's product-moment correlation
## 
## data:  BorrowerRate and CreditScoreRangeLower
## t = -175.17, df = 113340, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4661358 -0.4569730
## sample estimates:
##        cor 
## -0.4615667
## 
##  Pearson's product-moment correlation
## 
## data:  BorrowerRate and CreditScoreRangeUpper
## t = -175.17, df = 113340, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4661358 -0.4569730
## sample estimates:
##        cor 
## -0.4615667

To my surprise there is a fairly weak negative correlation (r^2 = -0.46157) between borrower credit score and interest rate. a negative correlation at all would make sense because the higher a borrower’s credit score is, the more creditworthy they are considered, and the better (lower) an interest rate they should get on a loan. clearly there are other factors affecting borrower interest rates.

Credit score is bunched in a ‘cloud’ formation mostly between 650 and 850. we can distinguish vertical bands along integers in the interest rate expressed as a percentage and a range of credit scores. there are also a string of faded data points along the bottom where it doesn’t seem Prosper was able to acquire an individual’s credit score and so must have used another metric to gauge risk.

## 
##  Pearson's product-moment correlation
## 
## data:  BorrowerRate and DebtToIncomeRatio
## t = 20.465, df = 105380, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.05690080 0.06892819
## sample estimates:
##        cor 
## 0.06291678

Limiting debt:income ratio to account for 99% of values we see that no linear relationship exists between debt:income and borrower interest rate. this is confirmed with an r^2 value of 0.0629. in the multivariate plots section we could see if this is due to the debt levels carried by homeowners in particular.

I chose to eliminate the top 1% of debt:income ratio because of the disclaimer in the variables file over debt levels listed as 10x income. these values are most assuredly outliers that skew otherwise insightful results when working with debt:income ratio.

## # A tibble: 15 x 4
##    StatedMonthlyIncome debt_income_mean debt_income_median     n
##                  <dbl>            <dbl>              <dbl> <int>
##  1              0.0833            NA                  NA     251
##  2              0.25              10.0                10.0     1
##  3              0.833             10.0                10.0     1
##  4              1.42              10.0                10.0     1
##  5              1.67              NA                  NA       1
##  6              1.83              10.0                10.0     2
##  7              1.92              10.0                10.0     1
##  8              2.08              10.0                10.0     1
##  9              2.17              NA                  NA       1
## 10              2.33              10.0                10.0     1
## 11              2.42              10.0                10.0     3
## 12              2.5                6.83               10.0     3
## 13              2.58              NA                  NA       1
## 14              2.67              10.0                10.0     1
## 15              2.92              NA                  NA       6

Once again for this plot I limited the y-variable to 99% of observations because of instances where a placeholder value of 10.01 exists. this also keeps the y-axis value below a 1:1 debt to income ratio ie. a debt:income ratio of 1.0

I have also log transformed the self-reported monthly income on the x-axis to make the data displayable. we see many different debt:income ratio values for monthly incomes below $10,000 with relatively lower ratios for incomes abve $10,000.

## # A tibble: 2 x 4
##   IsBorrowerHomeowner debt_home_mean debt_home_median     n
##   <fct>                        <dbl>            <dbl> <int>
## 1 False                        0.276             0.21 51206
## 2 True                         0.276             0.22 54177

Here I wanted to check the difference in conditional mean & median debt:income ratio for homeowners v non-homeowners and was able to do so after figuring out to drop_na() values from the ratio prior to the group_by() homeownership. Suprisingly, the mean and median debt:income ratios for each category are nearly equal. I’m not going to bother plotting this conditional mean and median for now.

BOX PLOTS FOR HOMEOWNERSHIP : credit score & borrower rate

## prosperLoan$IsBorrowerHomeowner: False
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0   640.0   680.0   665.8   700.0   860.0     591 
## -------------------------------------------------------- 
## prosperLoan$IsBorrowerHomeowner: True
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   660.0   700.0   704.8   740.0   880.0

## prosperLoan$IsBorrowerHomeowner: False
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1449  0.1980  0.2029  0.2624  0.4975 
## -------------------------------------------------------- 
## prosperLoan$IsBorrowerHomeowner: True
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1239  0.1700  0.1828  0.2394  0.3600

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

Throughout my bivariate explorations I have been trying examine the variables that contribute most to a borrower’s assigned interest rate. I have not found any strong linear correlations that affect interest rate as a response variable.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

I found an unexpected equivalence between the debt:income ratios of homeowners and non-homeowners. originally I had suspected that homeowners would carry a much higher debtload with a mortgage and that this would have a noticeable effect on the debt:income ratio. that was not the case and leads me to question how the debt:income ratio is calculated.

What was the strongest relationship you found?

The strongest relationship I encountered between 2 variables was between credit score (lower range) and borrower interest rate. even as the strongest I found, this correlation was still weak with an r^2 value of -0.46157

the other realtionships that I plotted between borrower rate and lender yield showed expected strong relationships since the point of the Prosper platform is to provide a close spread between interest rate and yield. if the spread were to wide, Prosper would not survive as a company.

Multivariate Plots Section

Here we see the line plots for conditional summary stats (mean & median) for borrower interest rate by credit score value split between homeownership status. notice that credit scores are generally higher at both mean and median interest rate points for homeowners vs. non-homeowners. mean & median credit scores for non-homeowners are capped below 800.

Above are the line plots for conditional means of borrower interest rate by credit score value greater than 600 (loans with lower than a 600 credit score were all in an undefined ratings bucket) split between the Prosper rating factors. we clearly see strata of loans defined by flat lines along certain mean interest rate values with the reddish line for loans with an undefined Prosper rating decreasing along the x-axis as credit score increases along the y-axis. these are likely loans funded prior to July 2009 and hence external to the Prosper ratings system.

I converted loan term to a factor variable and did a quick split of mean borrower interest rate by credit score value. it is quite apparent that interest rates are higher for longer termed loans given a constant credit score value. that makes sense given that long-term interest rates are most often higher than short & medium-term interest when the US economy is functioning properly. there are several points in the above plot where 36-month and 60-month interest rates are slightly inverted (where shorter term mean rates are higher than longer mean term rates).

This multivariate plot is a scatterplot of mean borrower interest rate responding to log-transformed employment duration in months. this data is then split and colored by employment status in order to see if we can pick out any trends. a much tighter cluster of points is found towards the right side of the x-axis signaling great number of loans to people with between 100 and ~800 months of continuous, full time employment or self-employment. we see the other employment status signifiers – such as ‘Retired’ or ‘Not employed’ – on the outskirts of the main cluster. noticeably, ‘Retired’ status loans have lower mean interest rates than the ‘Not employed’ classified loans.

## 
## Calls:
## mod1: lm(formula = I(BorrowerRate) ~ I(CreditScoreRangeLower), data = prosperLoan)
## mod2: lm(formula = I(BorrowerRate) ~ I(CreditScoreRangeLower) + EmploymentStatusDuration, 
##     data = prosperLoan)
## mod3: lm(formula = I(BorrowerRate) ~ I(CreditScoreRangeLower) + EmploymentStatusDuration + 
##     DebtToIncomeRatio, data = prosperLoan)
## mod4: lm(formula = I(BorrowerRate) ~ I(CreditScoreRangeLower) + EmploymentStatusDuration + 
##     DebtToIncomeRatio + LoanOriginalAmount, data = prosperLoan)
## 
## ============================================================================================
##                                  mod1            mod2            mod3            mod4       
## --------------------------------------------------------------------------------------------
##   (Intercept)                     0.549***        0.655***        0.659***        0.620***  
##                                  (0.002)         (0.002)         (0.002)         (0.002)    
##   I(CreditScoreRangeLower)       -0.001***       -0.001***       -0.001***       -0.001***  
##                                  (0.000)         (0.000)         (0.000)         (0.000)    
##   EmploymentStatusDuration                        0.000***        0.000***        0.000***  
##                                                  (0.000)         (0.000)         (0.000)    
##   DebtToIncomeRatio                                               0.008***        0.008***  
##                                                                  (0.000)         (0.000)    
##   LoanOriginalAmount                                                             -0.000***  
##                                                                                  (0.000)    
## --------------------------------------------------------------------------------------------
##   R-squared                       0.213           0.252           0.275           0.306     
##   adj. R-squared                  0.213           0.252           0.275           0.306     
##   sigma                           0.066           0.065           0.063           0.062     
##   F                           30684.346       17864.461       12398.484       10778.542     
##   p                               0.000           0.000           0.000           0.000     
##   Log-likelihood             146748.641      139635.254      131124.703      133223.759     
##   Deviance                      498.165         450.051         393.321         376.809     
##   AIC                       -293491.283     -279262.507     -262239.407     -266435.518     
##   BIC                       -293462.368     -279224.211     -262191.949     -266378.569     
##   N                          113346          106312           97888           97888         
## ============================================================================================

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

In this section we have observed multiple variables as explanatory of fluctuations in loan interest rates among the sample population of borrowers. First, we saw that homeownership generally tracks to a higher mean /median interest rate given a specific credit score value, and vice versa. then we took a look a Prosper’s own loan ratings system and saw how borrower interest rates stay relatively close to specific mean interest rate values along the credit score spectrum. we even saw how some mean interest rates in certain ratings catergories – such as D & E – actually go up as credit scores improve! following that we looked at mean interest rates by credit score and confirmed that longer term interest rates are higher than shorter term rates holding credit scores constant, but that rates trend downward across all loan terms as credit ratings improve.

Were there any interesting or surprising interactions between features?

What is interesting in all of this is that no one explanatory variable beyond credit score can correlate explicitly to the determination of a borrowers interest rate, and that even in certain edge cases the interest rate will increase as credit score improves; which is the inverse from what we would expect.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

The R-squared values in my linear model that attempt to capture the effect of mutiple variables on borrower interest rate are quite weak. clearly there are not any intuitive relationships between interest rate and any of the variables that I picked out of the 82 possibilities. I had originally tried to pick variables which I thought would reveal a correlative relationship. maybe I should have asked an actuary or a loan underwriter.


Final Plots and Summary

Plot One

Description One

Final Plot 1 is a scatterplot of borrower interest rate for credit scores above 0 for different levels in the Prosper ratings system. as expected we see stronger ratings (A, AA, B) clustered below the trendline and to the right of the x-axis for credit score. the weaker categories (C, D, E, HR) are above the smoothed mean trendline, but are also present for higher values of Credit score along the x-axis.

We also see a dark brown cluster of data to the left both above and below the mean trendline. the color signifies loans issued prior to July 2009 that were not rated by Prosper. for some interesting reason these loans are mostly for lower sub-prime credit profiles according to the FICO credit scores.

Plot Two

Description Two

Building on the previous plot we can split the data further by accounting for homeownership status where the left plot shows non-homeowners and the right plot is for homeowners. I would have hoped that homeownership status, and hence carrying a mortgage if the house is not already paid off, would have affected credit to interest rate composition. what we can notice is that the mean trendline of non-homeowners is slightly more flat than that for homeowners; meaning that interest rates are slightly less responsive to changes in credit scores for one category (non-owner) vs. the other (owners).

The loan composition for each category is largely similar, though we do notice that there are more ‘unrated’ loans for non-owners and more ‘high-risk’ loans for homeowners.

Plot Three

Description Three

Here our final plot displays borrower interest rate and debt:income similar to the bivariate plot that was created earlier. here we see that the loan data can we further visualized to display self-reported income levels via a qualitative color scale. we see that while incomes vary across the portfolio of loans, those borrowers that report incomes in the $74,999 range and below tend to carry higher debt to income levels. we can only assume that debt:income ratios have been calculated exclusive of the loan in the loan application and once the loan has been approved and funded, borrowers would see an increase in debt holding income constant.

though a majority of loans are listed as ‘Income Verifiable’ we still see a number of data points where income is ‘Not displayed’. these loans tend to carry exhorbitant interest rates.


Reflection

The Prosper loan dataset contains a wealth of information across nearly 114,000 consumer loans issued pior to 2014, with some variables containing descriptions dating the loans back before July 2009. the univariate exploration enabled me to understand the structure of the dataset as well as some key variables that based on my experience with a portfolio of consumer loans on a similar platform. subsequent bivariate and multivariate explorations led to some interesting findings of the near total lack of strong relationships between borrower interest rate – a key response variable along with APR and lender /investor yield – and the selected explanatory variables. investors in loans make decisions on which loans to fund based on interest rate and perceived risk.

I was most surprised that borrower credit score did not have a stronger correlation to interest rate, as the FICO credit score is the most objective and comprehensive 3rd party sourced piece of evaluatory information that Prosper collects on each individual borrower. I then explored interest rate relative to debt to income ratio and did not find a strong relationship at all. homeownership was another binary factor variable that I thought would have an outsize effect on debt to income ration and hence interest rate, but that did not seem to be the case.

Finally I relented and introduced the proprietary Prosper Rating into my analysis which classified loans based on risk factors not supposedly beyond just credit score. credit score proved to be the strongest related variable to borrower interest rate and slicing these loans up by Prosper rating showed clear stratification between loan segments and the interest rate – by extension yield /return – assigned at the approval for funding of the loan.

Given all of the data that was provided I was slightly disappointed in my inability to find tha ‘secret sauce’ that Prosper uses to risk assess these loans. at the end of the day I’m left to believe that Prosper uses credit score and maybe a few other weighted variables and calls it a day. for further analysis I would have liked to use a uniform dataset that did not include a sizeable portion of loans that were left unclassified for certain factor variables or had missing numeric values for continuous variables. I would have much preferred an updated real life dataset to perform this analysis.

One thing I have learned is that I would not engage as an investor /lender on Prosper’s platform for the purposes of yielding a market-beating return on my investment. I’d bet losses and chargeoffs are quite underestimated.